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Abstract — We compare two /-divergences and prove that their 
joint range is the convex hull of the joint range for distributions 
supported on only two points. Some applications of this result 
are given. 

Index Terms — /-divergence, convexity, joint range. 



I. Divergences and divergence statistics 

MANY of the divergence measures used in statistics 
are of the /-divergence type introduced independently 
by I. Csiszar [1], T. Morimoto [2], and Ali and Silvey 0. 
Such divergence measures have been studied in great detail in 
0. Often one is interested inequalities for one /-divergence 
in terms of another /-divergence. Such inequalities are for 
instance needed in order to calculate the relative efficiency of 
two /-divergences when used for testing goodness of fit but 
there are many other applications. In this paper we shall study 
the more general problem of determining the joint range of any 
pair of /-divergences. The results are useful in determining 
general conditions under which information divergence is a 
more efficient statistic for testing goodness of fit than another 
/-divergence, but will not be discussed in this short paper. 

Let / : (0, oo) — > R denote a convex function satisfying 
/ (1) = 0. We define / (0) as the limit lim t _> / (*)■ We define 
/* (t) = tf (V 1 ) . Then /* is a convex function and /* (0) 
is defined as lim t ^ tf (t _1 ) = lim^oo ^-. Assume that P 
and Q are absolutely continuous with respect to a measure 
//, and that p — 4^ and q = -Q. For arbitrary distributions 
P and Q the /-divergence Df(P,Q) > is defined by the 
formula 



kD(P\\Q) 



D f (P,Q)= f /(£) dQ + f*(Q)P(q = 0) 

J{ q >o} \qj 



(1) 



(for details about the definition (Q]) and properties of the /- 
divergences, see 0, or 0). With this definition 

D f (P,Q) = D f ,(Q,P). 

Example 1: The function f(t) = \t — 1| defines the L 1 - 
distance 
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which plays an important role in information theory and 
mathematical statistics |71, l|8l . 




Fig. I . The joint range of total variation V and information D as determined 
in (§). It was also proved that any point in the range 



In (Q3 is often taken the convex function / which is one of 
the power functions <j> a of order a e M given in the domain 
t > by the formula 



«(*) 



t a -a(t-l)-l 



a(a ~ 1) 
and by the corresponding limits 

cj) (t) =-lnt + t-l and 

The ^-divergences 

def 



when a(a - 1) ^ (3) 



L (t) =t\nt-t + l. (4) 



D a (P,Q) = J D^ a {P,Q), ae 



(5) 
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based on (0) and (0]) are usually referred to as power diver- 
gences of orders a. For details about the properties of power 
divergences, see or 0. Next we mention the best known 
members of the family of statistics 0, with a reference to 
the skew symmetry D a (P,Q) = D\- a {Q,P) of the power 
divergences 0. 

0000-0000/00$00.00 © 2010 IEEE 
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Fig. 2. Joint range of total variation and Jensen-Shannon divergence. The 
2-point achievable pairs have dark shading and the 3-point achievable pairs 
have light shading. 



Example 2: The x 2 -divergence (or quadratic divergence or 
Pearson divergence) 



k 



3=1 



(Pj - gj) 2 

9j 



(6) 



leads to the well known Pearson and Neyman statistics. The 
information divergence 



D 1 (P,Q)=D (Q,P) = 



-Va 



3=1 



(7) 



leads to the log-likelihood ratio and reversed log-likelihood 
ratio statistics. The symmetric Hellinger divergence 

D 1/2 (P,Q) = D 1/2 (Q,P) = H(P,Q) 

leads to the Freeman-Tukey statistic. 

Example 3: The Hellinger divergence and the total variation 
are symmetric in the arguments P and Q. Non-symmetric 
divergences may be symmetrized. For instance the LeCam 
divergence is nothing but the symmetrized \ 2 -divergence 
given by 



D 



LeCam 



(P,Q) 



i 



:Do P, 



P + Q 



H^-i 



Q 



Another symmetrized divergence is the Jensen Shannon diver- 
gence defined by 



JD 1 (P,Q) = ^D[ !> 



P + Q 



1 



D[Q 



P + Q 



The joint range of total variation with Jensen Shannon diver- 
gence was studied by Briet and Harremoes [9 1 and is illustrated 
on Figure [2] 

In this paper we shall prove that the joint range of any 
pair of /-divergences is essentially determined by the range 
of distributions on a two-element set. In special cases the 
significance of determining the range over two-element set 
has been pointed out explicitly in iflOl . Here we shall prove 
that a reduction to two-element sets can always be made. 



II. Joint range of /-divergences 

In this section we are interested in the range of the map 
(P, Q) -> (D f (P, Q) , D g (P, Q)) where P and Q are proba- 
bility distributions on the same set. 

Definition 4: A point (x,y) G K 2 is (/, g)-achievable if 
there exist probability measures P and Q on a cr-algebra 
such (x,y) = (D f (P,Q),D g (P,Q)). A (/,</) -divergence 
pair (x, y) is d-achievable if there exist probability vectors 
P,Q eR d such that 

(x,y) = (D f (P,Q),D g (P,Q)). 

Lemma 5: Assume that 

Po (A) = Qo (A) = 1 

and 

Pi (B) = Q x (B) = 1 

and that A n B = 0. If P a = (1 - a) P Q + aP x and Q a = 
(l — a)Qo + aQi then 

D f {P a ,Q a ) = (1 - a) D f (P , Qo) + aDf (Pi, Qi) . 

Theorem 6: The set of (/, g)-achievable points is convex. 
Proof: Assume that (P, Q) and (P,Q) are two pairs 
of probability distributions on a space (X, J 7 ) . Introduce a 
two-element set B = {0,1} and the product space XxB 
as a measurable space. Let <fi denote projection on B. Now 
we define a pair I P, Q) of joint distribution on XxB. The 

marginal distribution of both P is Q on B is (1 — a, a) . The 
conditional distributions are given by P (• | <f> = i) = Pi and 
Q (• | (f> = i) = Qi where i = 0, 1. Then 

D f (P a ,Q a ) \ 

D g (P a ,Q a ) J 

(l-a)D f (Po,Q ) + aDf(P 1 ,Q 1 ) 
(1 - a) D g (Po, Qo) + aD g (Pi, Qi) 

D/(Po,Qo) }_,_„( P/(Pi,Qi) 
A,(Po,Qo) . 

Df(PQ) 
D g {P,Q) 



= (!-«) 

= (l-n) 




Example 7: For the joint range of total variation and Jensen 
Shannon divergence illustrated on Figure [2] the set of 2- 
achievable points is not convex but the set of 3-achievable 
points is convex and equals the set of all (/, g)-achievable 
points. 

Theorem 8: Any (/, g)-achievable points is a convex com- 
bination of two 2-achievable points. Consequently, any (/, g)- 
achievable point is 4-achievable. 

Proof: Let P and Q denote probability measures on Borel 
space. Define the set A — {q > 0} and the function X = p/q 
on A. Then Q satisfies 



/, 



QL4) = 1, 
X dQ < 1. 



(8) 



Now we fix X and A. The formulas for the divergences 
become 

D f (P, Q) = / / (X) dQ + f* (0) P (CA) 



/ (x) dQ + /* (o) i 



X dQ 



(f(X) + .f (0)(1-X)) dQ 



= E[f(X) + f*(0)(l-X)} 



and similarly 



D fl (P,Q)=Eb(X) + . 9 *(0)(1-X)]. 

Hence, the divergences only depend on the distribution of X. 
Therefore we may without loss of generality assume that Q is 
a probability measure on [0, oo). 

Define C as the set of probability measures on [0, oo) 
satisfying E [X] < 1. Let C + be the set of additive measures 
// on [0, oo) satisfying (J, (A) < 1 and J.X d/J, < 1. Then 
C + is convex and thus compact under setwise convergence. 
According to the Choquet-Bishop-de Leeuw theorem ifTTl 
Sec. 4] any other point in C + is the barycenter of a probability 
measure over such extreme points. In particular an element 
Q € C is the barycenter of a probability measure Pbary 
over extreme points of C + and these extreme points must 
in addition be probability measures with Pf, ar2/ -probability 1. 
Hence Q 6 C is a barycenter of a probability measure over 
extreme points in C. 

Let Q be an element in C. Let Ai,i — 1,2,3 be a disjoint 
cover of [0, oo) and assume that Q (Aj) > 0. Then 

3 

Q = ]Tq(A)Q(-|A). 
t=i 

For a probability vector A = (Ai,A2,A2) let Q\ denote the 
distribution 

3 

Qa=$> 1 Q(-|A i ). 



Then Q\ is element in C if and only if 

[ X dQ(- | Ai)<l. 

J A 



3 

E 

t=l 



A; 



(9) 



An extreme probability vector A that satisfies (O has one or 
two of its weights equal to 0. Hence, if Q is extreme in C 
and Ai, i = 1, 2, 3 is a disjoint cover of A, then at least one 
of the three sets satisfies Q (Aj) = 0. Therefore an extreme 
point Q G C is of one of the following two types: 

1) Q is concentrated in one point. 

2) Q has support on two points. In this case the inequality 
J A X dQ < 1 holds with equality and P (A) = 1 so 
that P is absolutely continuous with respect to Q and 
therefore supported by the same two-element set. 

The formulas for divergence are linear in Q. Hence any 
(f,g) -divergence pair is a the barycenter of a probability 
measure Pbary over points generated by extreme distributions 
Q € C. The extreme distributions of type 2 generate 2- 
achievable points. 




Fig. 3. The slashed curve connects yi and y2. The lines l ± and £ 2 are 
not illustrated. 



For extreme points Q concentrated in a single point we can 
reverse the argument at make a barycentric decomposition with 
respect to P. If an extreme P has a two-point support then Q is 
absolutely continuous with respect to P and generates a (/, g)- 
achievable point that is 2-achievable. If P is concentrated 
in a point then this point may either be identical with the 
support of Q and the two probability measures are identical, 
or the support points are different and P and Q are singular 
but still (P, Q) is supported on two points. Therefore any 
(/, g)-achievable point has a barycentric decomposition into 
2-achievable points. 

Let y = (y, z) be a (/, g)-achievable point. As we have 
seen y is a barycenter of (/, g)-achievable points that are 2- 
achievable. According to the Caratheodory's theorem [ 12 1 any 
barycentric decomposition in two dimensions may be obtained 
as a convex combination of at most three points yi, i = 1, 2, 3. 
as illustrated in Figure [3] Assume that all three points have 
positive weight. Let £i be the line through y and y^ . The point 
y divides the line ii in two half-lines £j~ and £^ , where £~ 
denotes the half-line that contains y^. The lines if, i — 1,2,3 
divide M 2 into three sectors, each of them containing one of 
the points y i; i = 1,2,3. The set of (/, g)-divergence pairs that 
are 3-achievable is curve-connected so there exist a continuous 
curve of (/, g) -divergence pairs that are 2-achievable from yi 
to y2 that must intersect £f U if in a point z. If z lies on if 
then y is a convex combination of the two points y; and z. 
Hence, any (/, ^-divergence pair is a convex combination of 
two points that are 2-achievable. From the construction in the 
proof of Theorem [6] we see that any (/, g) -divergence pair is 
4-achievable. 

An /-divergence on an arbitrary cr-algebra can be ap- 
proximated by the /-divergence on its finite sub-algebras. 
Any finite cr-algebra is a Borel cr-algebra for a discrete 
space so for probability measures P, Q on a cr-algebra the 
point (Df (P, Q) , D g (P, Q)) is in the closure of 4-achievable 
points. For any function pairs (/, g) the intersection of the 
set of 2-achievable points and the first quadrant is closed. 
4-achievable points are convex combinations of 2-achievable 
points so the intersection of the 4-achievable points and the 
first quadrant is closed contains (Df (P, Q) , D g (P, Q)) even 
if P, Q are measures on a non-atomic cr-algebra. ■ 




The set of (/, g)-achievable points that are 2-achievable can 
be parametrized as P = (1 — p,p) and Q = (1 — q, q) . If we 
define (l-p,p) = (p, 1 - p) then D f (P, Q) = D f (P,Q) . 
Hence we may assume without loss of generality assume that 
p < q and just have to determine the image of the simplex 
A = {(p, q) |0<p<<7<l}. This result makes it very easy 
to make a numerical plot of the (/, g)-achievable point is 2- 
achievable and the joint range is just the convex hull. 

III. Image of the triangle 

In order to determine the image of the triangle A we have 
to check what happens at inner points and what happens at or 
near the boundary. Most inner points are mapped into inner 
points of the range. On subsets of A where the derivative 
matrix is non-singular the mapping (P,Q) — > (Df,D g ) is 
open according to the open mapping theorem from calculus. 
Hence, all inner points that are not mapped into interior points 
of the range must satisfy 



dD, 



dq 



dD g 

dp 

dD Q 



0. 



Depending on functions / and g this equation may be easy 
or difficult to solve, but in most cases the solutions will lie 
on a 1 -dimensional manifold that will cut the triangle A into 
pieces, such that each piece is mapped isomorphically into 
subsets of the range of (P,Q) -4 (Df,D g ). Each pair of 
functions (/, g) will require its own analysis. 

The diagonal p = q in A is easy to analyze. It is mapped 
into (D f ,D g ) = (0,0). 

Lemma 9: If / (0) = 
the supremum of 



If 9 (0) 
mum of 



oo, and lim t _j.o sup 414 



7o, then the supre- 



D g (P,Q)-*yD f (P,Q) 



over all distributions P, Q is oo if 7 < 70. 
If g* (0) = 00, and lim^oo sup |^j 
supremum of 

D g (Q,P)- 7 D f (Q,P) 

over all distributions P, Q is 00 if 7 < 70. 
Proof: Assume that 



70, then the 



/(0) 
The first condition implies 



'(*) 



00 and lim inf ■ 

t-o / (t) 



A>. 



£> / ((l,0),(l/2,l/2)) = oo 
and the second condition implies that g (0) = 00 and 

D s ((l,0),(l/2,l/2)) = oo. 

We have 

£> fl ((p,l-p),(l/2,l/2)) 
D/((p,l-p),(l/2,l/2)) 

_ g (2p)/2 + g(2(l-p))/2 
/(2p)/2 + /(2(l-p))/2 

= g(2p)+g(2(l-p)) 
/(2p) + /(2(l-p)) 

-4- j3 for n 
A, ((^, 1-^,(1/2, 1/2)) 



Let (t r , 
Then 



be a sequence such that jff\ 



oc. 



DfW 



1 



,(1/2,1/2)) 



P 



2 ' 2 

and the first result follows. 

The other three cases follows by interchanging / and g, 
and/or replacing / by /* and g by g*. We have used that 



lim inf — 

J-j-0 /* (t) 



tg (t- 1 ) 

lim inf —-7 — -4- 
t-^o ft/ (t- 1 ) 



lim inf — — . 

t-.oo / (t) 



Proposition 10: Assume that / and g are C 2 and that 
/" (1) > and g" (1) > 0. Assume that lim t _, inf firv > 0, 



fit) 



and that lim t _ 
that 



j inf 4t4 > 0. Then there exists (3 > such 
D g (P,Q)>0-D f (P,Q) (10) 



00, and lim 4 ^o inf $& = fa, then f or all distributions P, Q. 



fit) 



0-D f {P,Q)-D g {P,Q) 

over all distributions P, Q is 00 if j3 > /3q- 



If /* (0) = 
supremum of 



00, and rim t _ 



inf 



9(«) 

m 



/?o, then the 



Proof: The inequality lim t _j.o inf jh{ > implies that 
there exist Po,to > such that g (t) > Pof (t) for t < to- 
The Inequality limt^oo inf 4k| > implies that there exists 
/3oo > and t^ > such that g (t) > /3oo/ (*) for t > too. 
According to Taylor's formula we have 



0-Df(P,Q)-D g (P,Q) 

over all distributions P, Q is 00 if (3 > /3o- 






1) 



for some 9 and r\ between 1 and t. Hence 

/ (*) 9" (v) 9" (1) 
Therefore there there exists /3i > and an interval ]£_,£+[ 
around 1 such that 4W > /3i for £ G ]£_,£+[. The function 

£ — > yW is continuous on the compact set [£ , i_] U [£+,£oo] 
so it has a minimum (3 > on this set. Inequality \W\ holds 

for ? = mm {(30,(3!, Poo J}. ■ 

IV. Examples 

In this section we shall see a number of examples of how 
the method developed i this paper can be applied to determine 
the joint range for some pairs of /-divergences. Some of these 
results are known and others are new. We will not spell out all 
the details but shall restrict to the main flow of the argument 
that will lead to the joint range. 



A. Power divergence of order 2 and 3 
We have 

f(t) = Mt), 
a (*) = &(*)• 

In this case we have 

D f ((p,l-p),(q,l-q)) = 

1 ( (p-qf , Jp-qY 



q 



D g ((p,l-p),(q,l-q)) 




q 



i-q 



\_-p_ 

l-q 



(l-«)-l 



First we determine the image of the triangle. The derivatives 
are 

dDf _2 (p-q) 



dp 


2 {l-q)q ' 


dD f _ 


1 (2pq -q-p) (p-q) 


dq 


2 (l-q) 2 -? 2 


8D g 


-3 (2pq-q-p)(p-q) 


dp 


6 (l-g)V 


dD g 


{ pq + p 2 + q 2 - \( v _ a] 

2 \ 3pq 2 - 3p 2 q + 3p 2 q 2 J [P Q) 



dq 6 (q-lfq 3 

The determinant of derivatives is 



dDf dDg 

dv dp 



an 



(p - qY 



i2^(i - q y 



2pq- q-p 



3p + 3q — 6pq 

Qpq 2 - 2p 2 - 2q 2 

-2pq + 6p 2 q — 6p 2 q 2 



1 



p-q 



12 \q(l-q) 



We see that the determinant of derivatives is different from 
zero for p ^ q so the interior of A is mapped one-to-one to the 
image. Hence we just have to determine the image of points 
on the boundary of A (or near the boundary if undefined on 
the boundary). 

For P — (1,0) and Q — (1 — q, q) we get 



Df(P,Q) = \ 



l-q 



l 



l 



D g (P,Q) 



1 



1 



6 \(l-q) 
The first equation leads to 



l-q 
l (2-g)g 
6(l-9) 2 



and hence 



D„ 



2D f 



3^/ (Df + 1) 



We have 



9{t) 



t 2 -2(t-l) 


-1 


2 


t 3 -3(t-l) 


-1 


6 





oo for t — >• cxd. 



All points (0, s) ,s e [0, cxo) are in the closure of the range 
of (P, Q) —* (Df, D g ) . By combing these two results we see 
that the range consists of the point (0, 0) , all points on the 
curve (x, |a; (x + 1)) , X G (0, oo), and all point above this 
curve. 

Similar results holds for any pair of power divergences, but 
for other pairs than (Z?2, -D3) the computations become much 
more involved. 

Note that the Renyi divergences are monotone functions 
of the power divergences so our results easily translate into 
the results on Renyi divergences. More details on Renyi 
divergences can be found in lfl3l . 

B. Total variation and ^-divergence 
In this case we have 



/(*) 



II 



1 



g( X ) = -(x 



1 



The function / is not differentiable but on the triangle A we 
have p < q and 

1-p 



Df(P,Q) 



- 1 



+ (i-q) 



1 



- 1 



= 2(q-p). 

Hence Df (P, Q) is C°° on A although / is not differentiable. 
We get 

dD f _. o 



dp 


- -t , 


dDf 

dq 


= 2, 


0D g 


(p-q) 


dp 


(i-q)q ' 


dD g _ 


(2pq -q-p) (p-q) 



2(1-(?)V 



Hence 



9D f 

dp 



dp 
dD„ 



-2 

(v-i) 

(1-9)9 



2(1-<?)V 



Jq-pf(q- 1/2) 
(1-«)V 

The mapping A to the range of (Df, D g ) is singular for q = 
1/2. The line p — > (p, 1/2) is mapped into the curve 

p->(D f {P,Q),D g (P,Q)) 



,2(p-l/2Y 



If the total variation is denoted V this curve satisfies \ 2 = 
^V 2 and points satisfying \ 2 > \V 2 are 2-achievable. The 
inequality \ 2 > _-^ 2 has been proved previously by a different 
method fl4l . 



C. TotaZ variation and LeCam divergence 
On the triangle A we have 



D. Information divergence and reversed information diver- 
gence 

In this case we have 



9(t) 



tint, 
-Int. 



We see that g (0) = 00 and that ^LL 



W) 



—¥ 00 for t — > 0. Lemma 



__] implies that the supremum of 

D g (P, Q) - 1 D S (P, Q)=D (Q\\P) - 1 D (P\\Q) 

over all distributions P, Q is 00 for any 7 < 00. Similarly the 
supremum of 

D(P\\Q)—rD(Q\\P) 

over all distributions P, Q is 00 for any 7 < 00. Since (0, 0) 
is in the range and the range is convex, the range consist of 
all interior points of the first quadrant and the point (0, 0) . 
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D f (PQ)=2(q-p), 



D g (PQ) 



1 I (p- qf (p - qf 



4 \ p + q 2 —p — q I 

The derivatives of the LeCam divergence is 

d_ (p-g)(p + 3q-2pq-2q 2 ) 

0P DAF ' Q) - (p + q) 2 (2-p-q) 2 • 

d „ ,_, (2pq -q~3p+ 2p 2 ) (p - q) 
(p + q) 2 (p + q-2) 2 



D g (PQ) = 



Hence 



ODf dDg 

dp dp 



op 

dfi. 



-2 2 

(p-q)(p+3q~2pq~2q 2 ) (2pq-q~3p+2p 2 )(p-q) 



(p+ q y(2-p- q y (p+qY{p+q-2Y 

= 4(l-p-q)(q-pf 
(p + q) 2 (p + q-2) 2 ' 

The mapping is singular for q = 1 — p. We get the curve 



2(p-(l-p)) 



(p-(i-p)r , (p-a-p)Y 



p+(l-p) 2-p-(l-p) 
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If total variation is denoted V then the curve is D g — ^V 2 
and any point above this curve is achievable. 



